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Abstract. A pedigree is a directed graph that describes how indi- 
viduals are related through ancestry in a sexually-reproducing pop- 
ulation. In this paper we explore the question of whether one can 
reconstruct a pedigree by just observing sequence data for present 
day individuals. This is motivated by the increasing availability of 
genomic sequences, but in this paper we take a more theoretical 
approach and consider what models of sequence evolution might 
allow pedigree reconstruction (given sufficiently long sequences). 
Our results complement recent work that showed that pedigree re- 
construction may be fundamentally impossible if one uses just the 
degrees of relatedness between different extant individuals. We find 
that for certain stochastic processes, pedigrees can be recovered up 
to isomorphism from sufficiently long sequences. 



1. Introduction 

Since earliest civilisation people have been concerned with record- 
ing, deciphering and resolving their ancestry. The concept of a 'family 
tree' is widely familiar (even though the ancestry of an individual can- 
not remain a tree for too many generations into the past) and there 
are many methods for deciphering ancestry back several generations. 
Mostly these are somewhat ad-hoc, based on comparing and combining 
overlapping ancestries, oral and written records. 

However in recent decades the concept of deeper ancestry has become 
topical in molecular evolution. Firstly, the 'Out-of- Africa' hypothesis 
[1], now widely accepted, suggests that all extant humans are descen- 
dants of a relatively small population that migrated (possibly multiple 
times) out of Africa around 150,000-200,000 years ago. Secondly, re- 
cent theoretical work |^ suggests that most of the human population is 
likely to have common ancestors much more recently (thousands rather 
than hundreds of thousands of years ago). Thirdly, since the sequen- 
cing of the complete human genome in 2001, [5], [TT] and subsequent 
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improvements in the economics and speed of sequencing technology, it 
is quite possible that complete (or near-complete) genomic sequences 
for all individuals in a population could be available in the near future. 

These factors immediately suggest the question: what would a very 
large amount of genomic data tell us about the ancestry of a popu- 
lation? Clearly one can easily decide who are closely related (siblings, 
cousins etc), but how far back in time might one be able to reconstruct 
an accurate ancestry? To date, little is known about what is needed 
in order to formally reconstruct a pedigree (a graph that describes 
ancestry - defined formally below) though some initial results were 
presented in [H]. This is in marked contrast to another field in molec- 
ular evolution, namely phylogenetics, where there is a well-developed 
theory for reconstructing evolutionary ('phylogenetic') trees on species 
from the genetic sequences of present-day species In that setting 
genetic data is often highly informative for reconstructing detailed re- 
lationships between species deep into the past (tens or hundreds of 
millions of years). They can also be informative at short time frames 
when studying rapidly evolving organisms (such as HIV). 

However in phylogenetics the underlying graph is a tree, while in 
a pedigree it is a more 'tangled' type of directed graph. Moreover, 
the number of vertices in a tree is linearly related to the number of 
leaves (which represent the extant species on which we have informa- 
tion) while for a pedigree the number of vertices (individuals) can keep 
growing as we go further back in time. 

In this paper we continue the analysis started in [8] and attempt to 
determine models under which pedigrees might be reconstructed from 
sufficient data. We should point out that there is a well-developed 
statistical theory for pedigrees [TU], but this deals with different sorts of 
questions than pedigree reconstruction, such as estimating an ancestral 
state in a known pedigree. 

In [8] and [9], pedigrees were considered mainly from a combina- 
torial perspective. A question considered in both these papers was 
how best to construct pedigrees from certain combinatorial information 
about them, such as sets of distances between individuals, pedigrees on 
sub-populations, and so on. Several examples and counterexamples to 
combinatorial identifiability questions were presented. It seemed that 
constructing pedigrees would be a difficult task, if at all possible, and 
some of our intuition derived from phylogenetic trees would not carry 
over to pedigrees. 

A purpose of this paper is to consider pedigrees from a more sto- 
chastic perspective. We consider several stochastic models of evolution 
on a pedigree, that is, mechanisms by which individuals may inherit 
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sequence information from their parents. We consider the fundamental 
theoretical question: is the sequence information available in living in- 
dividuals in a population sufficient to construct the pedigree of the 
population, or might there instead be portions of a pedigree, that will 
always remain ghosts, unable to be clearly resolved regardless of how 
much sequence data one has on extant individuals? More formally, we 
are interested in whether non-isomorphic pedigrees could produce the 
same joint distribution of sequence information for living individuals. 
We begin with some combinatorial preliminaries and enumerate the 
number of distinct pedigrees to strengthen an earlier lower bound on 
the number of segregating sites that was derived in [5]. 

2. Definitions and preliminaries 

Mostly we follow the notation of [8]. Unless stated otherwise we 
will assume all (directed or undirected) graphs are finite, simple and 
without loops. A general pedigree is a directed acyclic graph P = (V, A) 
in which V can be written as the disjoint union of two subsets M and 
F ('Male' and 'Female'), and where each vertex either has no-incoming 
arc or two incoming arcs, with one from a vertex in M and the other 
from a vertex in F. The vertices with no in-coming arcs are called the 
founder vertices. 

In representing ancestry an arc [u, v) of P denotes that v is a child 
(offspring) of u (equivalently, u is a parent of v), and the conditions 
defining a pedigree simply state that each individual (not in the found- 
ing population) has a male and female parent, and that there is an 
underlying temporal ordering (acyclicity). 

In Figure [H a general pedigree is shown on the left. 

Given a directed graph G = (V, A) let M (G) = (V, E) be the graph 
on V whose edge set consists of all pairs {m, f} for which there exists 
w eV with (m, w) E a and (w, w) G A. In the case where G is a 'food 
web', M{G) is known as the 'competition graph' (see [B]). However in 
our setting, if G is a pedigree, then M{G) is the 'mate graph' of G, 
where a pair of individuals form an edge if they have at least one child. 

Lemma 1. A directed graph G = (y,A) is a pedigree if and only if 
(i) G is acyclic, (ii) M{G) is bipartite, and (Hi) no vertex of G has 
just one incoming arc. In particular it can be determined in polynomial 
time (in \V\) whether or not a directed graph is a pedigree. 

Proof. Conditions (i)- (iii) clearly hold if G is a pedigree. Conversely, 
if M(G) is bipartite V can be properly 2-coloured, with colour set 
{M, F}, and so we can write V as the disjoint union of two sets M, F 
so that each vertex with at least two incoming edges has exactly two 
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Figure 1. A general pedigree on X = {a} (left) and a 
simple pedigree with constant population size on X = 
{a,6,c} (right). 

incoming edges - one from a vertex in M and one from a vertex in 
F. Condition (iii) excludes the possibility of just one incoming edge, 
and so G is a pedigree. For the second claim, observe that the three 
conditions (i)-(iii) can all be established in polynomial time. □ 

The set of vertices that have no out-going arcs is denoted Xq, and for 
a particular distinguished subset X of Xq (called the extant individuals) 
we refer to (P, A) as a pedigree on X. We assume that the vertices in 
X are labelled, and other vertices are unlabelled. Two pedigrees on X 
are isomorphic if there is a diagraph isomorphism between them that 
fixes each element of X. 

We note in passing that in [S] it was sometimes assumed that the de- 
composition (M, F) of V was known, as this is not necessarily uniquely 
determined just by P; this in turn also allows a more restrictive defi- 
nition of isomorphism (called 'gender-isomorphism') in which the di- 
agraph isomorphism is required to map M (resp. F) vertices to M 
(resp. F) vertices. However we do not require or invoke this additional 
structure in the current paper. 

A simple pedigree is a pedigree in which the vertex set of the pedigree 
is a disjoint union of Xi] < i < d, and every arc {u,v) has its tail u 
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in Xi and its head v in for some i > 0. In this case, Xq is the set 

of extant vertices, and X^ is the set of founders, and d is the depth of 
the pedigree. In [8] and [9], the term 'discrete generation pedigree' was 
used instead of the term 'simple pedigree'. In simple pedigrees with a 
constant population size, all X^ have the same cardinality. In Figure [H 
a simple pedigree with a constant population size is shown on the right. 

The amount of information required to accurately reconstruct a pedi- 
grees on a set of size n, and up to depth d is clearly bounded below 
by some increasing function of the number of distinct (mutually non- 
isomorphic) simple pedigrees with a constant population size n and of 
depth d. Let this number be f{n,d). We first describe a lower bound 
on f{n,d) providing a slightly stronger bound than [8j. 

Let Xq = {xi] 1 < i < n} and Xi = {yf, 1 < i < n}. Consider a 
tree T defined on Xi. We construct a pedigree on Xq U Xi with the 
set of extant vertices Xq as follows: we first take an arbitrary onto 
map g from Xq to the edge set E(T) of T, and for every Xk G Xq, if 
g{xk) = {ViyVj}, then in the pedigree, Xk is a child of i/i and i/j. We 
count the number of pedigrees that can be constructed in this manner 
by considering all possible mutually non-isomorphic trees T, and all 
possible onto maps from Xq to E{T). For a fixed tree T, there are 
exactly {^{n — 1)! onto maps from Xq to E{T). Each map does not 
give us a distinct pedigree; in fact, each pedigree constructed this way 
is repeated |autT| times, where autT is the automorphism group of T. 
Thus we have 



[n 



f^n,l)>ySl^-— 
4^ autT 

where the summation is over all mutually non-isomorphic trees on Xi. 
Since r2!/|autT| is the number of labelled trees isomorphic to a given 
tree T, summing over all mutually non-isomorphic trees gives us 



where n"'"^ is the number of labelled trees on Xi, by Cayley's classic 
formula [2]. 

Observe that each vertex in Xi is distinguished in the pedigree, in 
the sense that no two vertices in Xi have the same set of children. 
This fact is useful to construct distinct pedigrees of arbitrary depth by 
repeating the same construction for arcs between Xi and X2, X2 and 
X3, . . . , Therefore, 
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Observe also that, since trees are bipartite, the directed graph con- 
structed is indeed a pedigree by Lemma [H 

The above estimate gives an information theoretic lower bound of 
{d/2)\ogn + o(logn) on the number of segregating sites needed for 
reconstructing a pedigree from DNA sequence data. This follows by 
the same argument as in [S] and is a slight improvement on the bound 
{d/3) logn + o(logn) established in that paper. 

3. Pedigree reconstruction 

In this section, we examine the question of constructing a pedigree 
from the information obtained from the extant individuals. In bio- 
logical applications, this information is typically provided by (DNA) 
sequence data. It is assumed that the information has been passed 
on to each individual by its parents; and, over generations, the infor- 
mation undergoes a stochastic change that models the evolutionary 
process. Is the information available at all extant individuals sufficient 
to uniquely construct the pedigree of the population? To be precise, 
are there examples of stochastic processes for which we cannot con- 
struct the pedigree, and are there examples of stochastic processes for 
which we can construct the pedigree? 

3.1. A negative result. We begin with a simple Markov process un- 
der which the information at the extant vertices (in the form of binary 
sequences of arbitrary length) is not sufficient to uniquely determine 
the pedigree. 

Suppose {ui] 1 < i < p} is the vertex set of a pedigree V. Suppose 
that associated with each vertex Ui in the pedigree V, there is a random 
variable Ui that takes values from a finite state space S. Let 

F{U, = a,\Uj = af,l<j<p,jy^i) 

denote the probability that Ui takes the value conditional on the 
states of random variables at all other vertices. We assume that 

F{Ui = ai\Uj = aj] 1 < j < pj i) = P(f/j = ai\Uj = aj, Uk = a^), 

where Uj and Uk are the parents of Ui. Is it possible to construct 
the pedigree up to isomorphism given the joint distribution P(f/i = 
oi, U2 = a2, ■ ■ . ,Un = an), where we use the indices 1 to n for extant 
vertices? 

Consider a symmetric two-state model given by the transition matrix 
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where the columns are indexed by the joint states of the parents of 
a vertex, and the rows are indexed by the state of the vertex. For 
example, the entry in the first column and second row says that the 
probability that a child is in state 1 conditional on both parents being 
in state is 1 — a. 

In the following, we construct non-isomorphic pedigrees V and Q, 
each on two extant vertices ui and U2, such that the joint distribution 
F{Ui = ai, U2 = 02), where G {0, 1}, is identical for V and Q. 

(1) Construct two disjoint binary pedigrees Bi]i G {1,2}, respec- 
tively, on extant vertices Ui and U2- The depth of each binary 
pedigree is t > 2. Let Si;i G {1,2} be the corresponding sets 
of their founders. 

(2) Construct a single intermediate pedigree V from Bi;i G {1,2} 
by identifying each vertex in Si with a unique vertex in 82- 
Construct pedigree V by adding vertices v and w as parents of 
all founder vertices in the pedigree V. 

(3) Construct pedigree Q as in the above step so that V and Q are 
non-isomorphic. This is possible when t >2. 

Figure [2] shows examples of V and Q for t = 2. 

Proposition 1. The pedigrees V and Q have the same joint distri- 
bution P(f/i = ai,U2 = 0,2), where Oj G {0,1}, under the symmetric 
model described above. Thus the two pedigrees cannot be distinguished 
from each other from binary sequences (of i.i.d. samples) of any finite 
(or infinite) length. 

Proof. First consider a binary pedigree, say Bi. Let k of the vertices in 
5*1 be in state 0. Let /(/c, t) denote the probability that the vertex ui is 
in state 0. Suppose ki of the states occur among the founders on the 
left tree, and k2 occur on the right tree, where the left tree and the right 
tree are the pedigrees of the two parents of Ui. Therefore, ki + k2 = k. 
A recurrence for f{k,t) is then written in terms of /i = f{ki,t — 1) 
and /2 = f{k2,t- 1). 

f{k, t) = a/1/2 + 0.5(1 - /i)/2 + 0.5/1(1 - /2) + (1 - a)(l - /i)(l - /2), 



where the four terms correspond to the four possible joint states of the 
parents of Ui. 
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Figure 2. Non-isomorphic pedigrees that produce in- 
distinguishable sequences under the symmetric stochas- 
tic modeL 



It can be verified by induction that the following expression for f{k, t) 
solves the recurrence. 

f{k, t) = -{2a - ly + 

Here the independence of /(/c, t) on exactly where the zero states occur 
among the founders is what is useful in the following. 

Now consider the intermediate pedigree V and consider the event 
E'fc that exactly k of its founders are in state (so k G {0, 1,2,3,4}). 
The conditional probability P(f/i = Oi, U2 = Ci2\Ek) is given by 

P(f/i = ai, f/2 = a2\Ek) = P(f/i = ai|^fc)P(f/2 = aal^fc), 

where each factors is either f{k, t) or 1 — f{k, t) depending on whether 
at are or 1, respectively. This is also true in Q'. 

The vertices v and w are added to both intermediate pedigrees as 
parents of vertices in 5*1 and 5*2 so as to guarantee that all possible 
joint states on Si that have k zeros are equally likely. This implies that 
for any given joint distribution on v and w, we have the same joint 
distribution on Ui and U2 in V and Q. □ 

We now show that exponentially many mutually non-isomorphic 
pedigrees can be obtained by this construction. 
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Proposition 2. The number of mutually non-is amorphic pedigrees that 
can be obtained by the above construction grows super- exponentially 
with t. 

Proof. Consider two disjoint binary pedigrees Bi of depth t > 2, on ex- 
tant vertices Ui, and founder sets St, where i G {1,2}. Let |S'i| = 2* = 
m. There are m! ways of identifying vertices in 5*2 with vertices in Si, 
but not all of them result in mutually non-isomorphic pedigrees. Con- 
sider a pedigree V obtained by identifying vertices in 5*2 with vertices 
in Si. The automorphism group of V is a subgroup of the automor- 
phism group of Bi. But |autSi| is 2"^"^, therefore, lautP'l < 2"^"^. 
Therefore, the number of mutually non-isomorphic pedigrees obtained 
by identifying vertices of ^2 with vertices in is at least 

m\ 

2m- 1 ' 

which implies the claim. □ 

3.2. Positive results. We first describe a simple deterministic pro- 
cess, and a related stochastic variation, under which the information 
available at the extant individuals is sufficient to construct the pedi- 
gree. We then describe a Markov model that comes closer to the 
mutation-recombination setting of genetic ancestry, for which pedigree 
reconstruction is also possible. This last model should be viewed as 
a proof-of-concept, rather than as realistic processes that capture all 
aspects of evolutionary processes. 

Example 1 (Deterministic process). Suppose each founder in the popu- 
lation has a distinct label. Consider an individual whose parents are 
labelled Y and Z. Suppose that each individual inherits the labels 
of its parents, and also has its own unique character that is not seen 
before in any other individual. In this way we assign the individual a 
label {{Y, Z}, X}, where X is a new symbol or a trait that no other 
individuals in the population, except for descendents of the individual 
under consideration, who inherit X in the manner described. 

From the labels of the extant individuals, the pedigree is uniquely 
constructed in a straight forward manner. First we construct the pedi- 
gree of each extant individual. Each individual's label uniquely deter- 
mines the labels of its parents and the new character that has arisen 
in the population for the first time. We recursively construct a binary 
tree of parents, grand parents, ... beginning with an extant individual. 
After constructing the binary tree, we identify vertices that have the 
same labels. Such vertices are ancestors to whom there are multiple 
paths from the extant individual. 
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The next step is to construct a (graph theoretic) union of pedigrees 
of all extant individuals. In constructing the graph theoretic union, 
vertices in different pedigrees that have the same labels are identified, 
and multiple arcs between two vertices are suppressed to leave a single 
arc between them. This completes the construction. 

Example 2 (Semi-deterministic process on the integers). Now we mod- 
ify Example [T] so as to introduce some randomness, and also to work 
over a fixed state space (the integers). Let be a large positive integer 
(sufficiently large relative to the number of vertices in the pedigree, in 
a sense that will be made more precise shortly). To each individual i in 
the pedigree we first associate an independent random variable Yi that 
takes a value selected uniformly at random from {1, . . . , A^}. We then 
assign a random state Xi to each vertex i of the pedigree as follows. If 
i is a founder, then set Xi = Yi. Otherwise, if i has parents j and k 
then set 

X, = 2^^+^ + 2^^=+^ + Yi. 

Observe that this process is Markovian (the state at a vertex depends 
just on the states at the parents, and not on earlier ancestors). More- 
over, if the random variables Yi take distinct values, then the pedigree 
can be uniquely constructed since 2°''^^ + 2^"+^ -|- m can be uniquely 
'decoded' as {{a,b},m}. If there are n vertices in the pedigree (and 
N > n) the probability that each random variable takes a distinct value 
is 

N{N -1)...{N -n + 1) 

which approaches 1 as tends to infinity. 

Therefore, under this process, a pedigree can be uniquely recon- 
structed by observing the random variables at the extant vertices, with 
a probability approaching 1 as tends to infinity. 

Although the above examples seems to be far removed from the 
reality of biological evolution, the concept underlying the examples is 
almost un-recognisably hidden in the following setting where the main 
consideration is to construct a process that models sequence evolution. 

4. A STOCHASTIC PROCESS ON SEQUENCES THAT ALLOWS 

RECONSTRUCTION 

The process of inheriting genetic material from parents may be con- 
ceptualised as follows. Suppose the parents Y and Z of an individual 
X have sequences {yi] i = 1,2, . . .} and {zf, i = 1,2, . . .}, respectively. 



RECONSTRUCTING PEDIGREES 



11 



Here the sequences are assumed to be sequences of characters drawn 
from [A^] = {1,2,..., A^}. We assume that the sequence {xj} of X is 
constructed by copying segments of sequences {yi\ and [zi] so that 
roughly half the genetic material is inherited from one parent, and 
roughly half from the other parent. In addition to the directly copied 
bits and pieces from its parents' genetic sequences, X also has in its 
sequence occurrences of segments that are not (recognised as) copies of 
segments of {yi\ and {zi\. We suppose that the X-specific fragments 
are constructed from characters drawn from a set Ux C [A^]; \Ux\ = fn, 
where Ux is chosen uniformly at random from the family of all subsets 
of [A^] of cardinality m. The process of construction of the sequence 
{xi] z = 1, 2, . . .} is then modelled as in a hidden Markov model. The 
copying process copies character from {yi}, and at some step, deter- 
mined by chance, begins copying characters from {zi}, or begins a 
random generation of a sequence of characters chosen from Ux- The 
process of copying from and switching between {yi}, {zi} and Ux con- 
tinues. 

But the segments copied from {yi} and {zi} are in turn partly in- 
herited from the parents of Y and Z, respectively, and partly from 
the y-specific and Z-specific segments, that is, segments of characters 
drawn from Uy and Uz, respectively. 

We model the above description by first defining a one to one corres- 
pondence between pedigrees and a subclass of finite automata that 
emit (to use the HMM terminology) character sequences at the extant 
individuals. We then demonstrate how a sufficiently long emitted se- 
quence determines first the automaton and then the pedigree with high 
probability. 

Without a loss of generality, we consider pedigrees with a single ex- 
tant vertex, since after constructing all sub-pedigrees having a single 
extant vertex, we can construct their graph theoretic union, as in Ex- 
ample [H This is discussed further in Remark [1] 

4.1. The automaton (directed graph) G, and the mechanism of 
sequence emission. Let Q be a pedigree with vertex set V"; |y| = n, 
with a single extant vertex x. The automaton associated with Q is 
denoted by a directed graph G on the vertex set V. For convenience, 
we have used the same vertex set V; so to avoid ambiguity, we denote 
an arc from y to z in Qhj yz, and an arc from y to z in G hj {y, z). 

The automaton G, its transition probabilities, and the mechanism 
by which it emits characters in the sequence of the extant vertex are 
defined so that the following conditions are satisfied. 
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(1) Let [61,62] C [0,1] and [Ai.As] C [0,1] be two intervals such 
that 6i are much smaller that Aj for i,j G {1, 2}. 

(2) For each internal vertex y, (that is, a vertex that is neither a 
founder vertex nor the extant vertex), there are two arcs {y,u) 
and {y, v) to its parents u and v, respectively, an arc {y, x) to the 
extant vertex x, and a self loop. We assume that the transition 
probabilities satisfy 

P{y,u),p{y,v) G [Ai, A2] 

and 

Piy,x),p{y,y) G [61,62]. 

(3) For the extant vertex x, there are outgoing arcs (x, y) and (x, z) 
to its parents, y and z, respectively, and a self-loop, with the 
corresponding transition probabilities given by 

p{x,y),p{x,z) G [Ai, A2] 

and 

p{x, x) + p{x, y) + p(x, z) = 1. 

(4) From a founder vertex z, there is one arc {z, x) to the extant 
vertex x, and a self-loop. The transition probabilities satisfy 

< p{z, x) < 62 

and 

p{z,x) +p{z,z) = 1. 

(5) Each vertex y of the automaton corresponds to a subset Uy 
of [A^], such that \Uy\ = m > 1, and Uy is chosen randomly 
from a uniform distribution on the family of subsets of [N] of 
cardinality m. The character sequence for x is emitted by the 
automaton as follows: the automaton defines a Markov chain 
with transition probabilities defined above; when the chain is 
in state y, (that is, at vertex y of the automaton), a character 
from Uy is emitted from a uniform distribution on Uy] y & V. 

The assumption that 6i are much smaller than A^ for i,j G {1,2}, 
and the conditions listed above imply that an individual derives most 
of its genetic material from its parents, who in turn receive most of 
their genetic material from their parents. 

Figure [3] shows a pedigree Q on 6 vertices and an automaton G 
that corresponds to the pedigree Q. The transition probabilities in the 
figure are denoted by Aij or 6ij instead of p{i,j) so as to indicate their 
relative magnitudes. 
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Figure 3. A pedigree and a corresponding automaton. 



We are interested in the following question: does a sufficiently long 
sequence {xi] i = 1, . . .} emitted by the automaton determine the pedi- 
gree unambiguously with high probability? Since the correspondence 
between the subclass of automata and pedigrees with a single extant 
vertex is one-to-one, the question is equivalent to asking if the auto- 
maton can be constructed unambiguously. The main result of this 
section is the affirmative answer to this question, formulated in the fol- 
lowing theorem. Note that although it deals with only a single extant 
vertex, we describe in Remark [1] how it extends to the general case of 
a pedigree over a finite set X. 

Theorem 1. Let Q be a pedigree having a single extant vertex. Let Q 
be associated with an automaton G that satisfies the conditions listed 
above. Let Sk = {x^; i = 1,2, . . . , k} be a sequence of characters from 
the set [N] = {1,2,..., A^}, emitted by the automaton (as in the fifth 
condition above). Then for sufficiently large k and N, the automaton G 
and the pedigree Q can be correctly reconstructed (with high probability) 
from the sequence S^. 

The theorem follows from the several lemmas proved next. 
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Lemma 2. Given an automaton G with its transition probabilities, the 
pedigree Q can be uniquely constructed. 

Proof. This follows from the relative magnitudes of the probabilities of 
transition. For distinct vertices u and v in G, the transition probability 
from M to f is high, (that is, in the interval [Ai, A2]), if and only if v 
is a parent of u in the pedigree Q. For a vertex u, the probability of 
transition from u to itself is high if and only if m is a founder vertex. A 
vertex u is the extant vertex of Q if and only if there is no other vertex 
f in G such that the probability of transition from v to u is high. □ 

Next we must construct the automaton G from the sequence S^- 
The idea of inference of the automaton G from the sequence Sk is 
based on the following observation. Suppose i,j G [A^] are such that 
there is only one Uy that contains i, and only one Uz that contains j. 
Then the observed transition probability p{i\j) in the sequence Sk is 
in the range [Ai/m, A2/m] if ?/ is a parent of z; and is in the range 
[6i/m,62/m] ii i E Ux and j G Uy, or if {i,j} C Uy, where y is an 
internal vertex. Similarly, one can argue about the magnitude of the 
observed frequency of i followed by j in for founder vertices, and for 
the extant vertex. What matters is whether the estimated probability 
is high (of the order of A^/m; i = 1, 2) or low (of the order of Si/ m; i = 
1,2). The transition probabilities p{i\j) can be estimated as accurately 
as desired by choosing sufficiently large k. It is crucial for the above 
argument that each Uy contains some state i that is unique to Uy, that 
is, i does not belong to a Uz for z other than y. This is the case with 
high probability for large N, as made precise in the following lemma. 

Lemma 3. Suppose that the sets Uy are randomly chosen from a uni- 
form distribution on the family of subsets of [N] of cardinality m. Let 
E be the event that each Uy contains at least one i that is not in any 
other Uz. The probability of this event E approaches 1 as N tends to 
infinity. 

Proof. Let Ei be the event that f/j is not a subset of Uj^iUj. Then, 
E = (I'l^^Ei, and by Boole's inequality [3], and symmetry. 



where the superscript c denotes complement. Now E^ is the event that 
Ui is a subset of U2 U E3 U ... U Un, and clearly the probability of this 
(complementary) event is maximised if U2, . . . ,Un are disjoint. In this 
case \U2 U ....Un\ = {n — l)m, and so ¥{Ef) is bounded above by the 
proportion of subsets of [A^] of size m that are subsets of a set of size 



n 
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implies F{E) ^ 1 as — oo. 



(n - l)m, i.e. F{E^) < 




. This, along with the above inequality, 



□ 



Let Ui C [N]]i = l,2,...n be the unknown character sets corre- 
sponding to the vertices 1,2,. . . ,n of the automaton. Let Ui denote 
the subset of Ui consisting of those elements that are unique to U, that 
is. 



By a recursive procedure, we construct U, and the pedigree Q on 
the vertex set [n] = {1, 2, . . . , n}. 

Without a loss of generality, assume that the extant vertex is labelled 
1, and the founder vertices are labelled from / to n. 

We first construct a directed graph H from the observed sequence 
Xi;i = 1,2,.... The vertex set V{H) of H is the set of states that 
appear in the emitted sequence Xi]i = 1, 2, . . .. The set of arcs of H 
is E{H), and an arc {u,v) is in E{H) if a transition from u to f is 
observed in Xi]i = 1,2, .. ., that is, if there is some i for which Xi = u 
and Xi+i = V. Each arc {u, v) of H is labelled high or low depending 
on whether the inferred probability p{v\u) of transition from n to is 
of the order of A/m or 5/m, respectively, where Ai < A < A2 and 
Si < 5 < 62- The inferred probabilities will be distinguishable as high 
or low for sufficiently long emitted sequences. 

Let d'l{u) and d^{u) denote the number of outgoing arcs from u that 
are labelled high and low, respectively. We count each self-loop as a 
single arc. 

Lemma 4. The sets Ui and Ui for founder vertices can be constructed. 

Proof. Suppose z is a founder vertex. Then from a state u in Ui, there 
are precisely m transitions with high probability. On the other hand, 
if i is not a founder vertex, then it has parents j and k; therefore, from 
a state u in Ui, there are at least \Uj U Uk\ > m + 1 outgoing arcs 
that are labelled high. Observe also that if i a founder vertex, and u 
is in Ui but not in U then there will be at least m + 1 outgoing arcs 
from u that are labelled high, since u will also be in some other Uj in 
that case. Therefore, u is in Ui for some founder vertex i if and only if 
d'^iu) = m. The set of all such vertices in H naturally partitions into 
blocks, one block Ui for each founder i, since if Ui and Uj correspond 
to two founders, and u E Ui and v G Uj then there will be transitions 
from M to f and from f to u in the emitted sequence if and only if 
Ui = Uj. Once U is known for each founder i, we can construct Ui as 
well: if there is an arc {u, v) that is labelled high for a state u in Ui 



u, = u,n {u,^,u,Y. 
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and a state v not in Ui, where i is a founder vertex, then v must be in 
Ui. □ 

In general, for vertices other than founders, we will be interested in 
constructing only f/j. 

We treat the above construction as the base case of a recursive pro- 
cedure for constructing all f/j. 

Let = {Ui} be the collection that has been constructed so far. 
At the end of the base case, each Uf, i > f is in T . The construction 
proceeds in a top-down manner; so if j and k are parents of i, and if 
Ui is in JF, then Uj and Uk have already been constructed and added 
in J-'. Let Us denote the union over all sets in J-'. 

Let Uj and Uk be any two distinct sets in J-' such that Ui for children 
i with parents j and k have not been constructed so far. 

Let Tjk be the set of states u for which the following conditions hold: 

(1) u is not in U5 Ur>/ Ur, and 

(2) there is a high arc {u, w) in H for every w in Uj U Uk 

Lemma 5. // a state u is in Tj^ then it is in Ui for some child i with 
parents j and k. If a state u is in Ui for some child i with parents j 
and k then u is in Tj^. 

Proof. When the second condition holds it is possible that u is in UjdUk 
and both j and k are founders. But this possibility is eliminated by 
the first condition. Therefore u must be in Ui for some child i with 
parents j and k. The second statement is then obvious. □ 

The above proposition implies that 

UiUi C Tjk ^UiUi, 

where the unions are over the children of j and k. 

Lemma 6. Let u be a state in Tjk- If u is in Ui for some child i with 
parents j and k then d'l{u) = \Uj U Uk\, (which may not be known). 
If u is not in Ui for any child i with parents j and k, then d'^{u) > 
\UjUUk\ + l 

Proof. The first statement follows from the fact that u is not in any 
other set f/^, and the second statement follows from the fact that u is 
in Ui for some child i with parents j and k and at least in one other 

Ur. □ 

Corollary 1. The set Tjk = UiUi, where the union is over children i 
of j and k, is recognised. 
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Proof. The set Tjk is the set of states u in Tjk for which d'^^u) is mini- 



Lemma 7. r/ie sei Tj-^ partitions into blocks Ui for children i with 
parents j and k. 

Proof. States v and w in Tjk are the same bfock if and only if there are 



This construction terminates when no more bfocks can be added to 
JF, thus completing the proof of Theorem [H 

Remark 1. In the above construction we recognised Ui for all vertices in 
the pedigree. We also recognised the parent-child relationships between 
them, which allowed us to construct the whole pedigree on the single 
extant vertex. Now suppose that we have a pedigree on more than one 
extant individuals. For each extant vertex we have a sequence emitted 
by the automaton that corresponds to the sub-pedigree on that extant 
vertex. It is reasonable to suppose that each vertex i in the pedigree 
corresponds to a unique Ui C [N]. Such a supposition means that the 
extant individuals that are descendents of i (the cluster of i) share some 
common traits, and the states in Ui are observed only in the sequences 
of the extant individuals in the cluster of i. We, therefore, construct 
the pedigree of each extant individual separately. To construct a graph 
theoretic union of all these pedigrees, we identify vertices y and z, re- 
spectively, in pedigrees Vi and Vj whenever Uy and Uz are identical. 
It is possible to generahse the correspondence between pedigrees and 
automata that was considered above to a correspondence between pedi- 
grees on multiple extant vertices and more general automata in which 
there are transitions from a vertex either to its parents or to itself or to 
any of its extant descendents. The mechanism for emitting characters 
would not be essentially different. For example, when the automaton 
is in state v, (that is, at vertex v), it would emit characters from U^ at 
all its descendents. 

4.2. Example. We now illustrate the above construction with an ex- 
ample. The matrix H below represents the directed graph H that was 
defined earlier. Thus its vertex set is the set of states observed in the 
emitted sequence, which in our example is {1,2,..., 14}. The arcs of 
H are labelled h (high) or / (low). 



mum. 



□ 




□ 
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Observe that the rows 6, 7 and 8 have the minimum number 3 of 
h, therefore, m = 3, and Ujf/j = {6,7,8}, where the union is over 
the indices of the founders. Also, observe the block structure of the 
sub-matrix consisting of rows and columns 6, 7 and 8: there are no 
arcs from 6 to 7 or 8, and no arcs from 7 or 8 to 6, but there are arcs 
between 7 and 8. Therefore, there are two founders in the pedigree. 
There are outgoing arcs (6, 10) and (6, 11) that are labelled h, therefore, 
the character set for one of the founders is Uf = {6, 10, 11}. Similarly, 
the character set for the other founder is Ug = {7, 8, 9}. We have called 
them Uf and Ug since we do not know how many vertices are in the 
pedigree; but the naming is not relevant. We now set JF = {Uf = 
{6},f/, = {7,8}}. 

We now consider pairs Uj and f/^ in S. In this case there is only one 
pair. The matrix H shows 6 states 4,5,9,10,12,13 that have high-aics 
to 6 and to {7, 8}, and are therefore the candidate states for inclusion 
in Ui for children i of j and k. We omit 10 from this list because 10 
is in f// but not in Uf. We then note that c?^(4) = d'l{5) = 6, while 
(i^(9), d^{12), and c?^(13) are all more than 6. Therefore, we eliminate 
9, 12 and 13 as well from the list of candidate states. Since there are 
no arcs between 4 and 5, the blocks to be included in JF are Ue = {4} 
and Ud = {5}. Both d and e are children of / and g. Here we also 
conclude that since 9, 10, 11, 12 and 13 are in L/^ U t/g U f/j U Ug, they 
cannot be in any Ui that will be discovered in future, so they do not 
have to be considered. 
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Next we have to repeat the process for all pairs of blocks in T (except 
of course the ones which we have already processed in earlier steps). 

Consider the pair tj^ and tjg. The states 2, 12, 13, 14 have high-arcs 
to each state in UeUtJg = {4, 7, 8}. But 12 and 13 have been eliminated 
before. Since d^{2) = (i^(2) = 6, and there are arcs (2, 14) and (14, 2), 
there is only one new block Uc = {2, 14}, and c is a child of e and g. 

Next we claim that d and g have no child together since only state 
13 has high-aics to all states in UdU Ug = {5,7,8}, but 13 has been 
eliminated earlier. By similar reasoning, we claim that vertices e and 
/ do not have a child, and vertices d and / do not have a child. 

Next we note that the states 3, 11 and 13 have high-arcs to all vertices 
in Ud^Ue = {4, 5}. But 11 and 13 were eliminated earlier. Therefore, 
the next block to be added to JF is t/b = {3}. 

Only 11 and 13 have high-aics to all states in Uf and Ud- But 11 
is in Uf, where / is a founder, and 13 has high-axes to vertices in Ug. 
Therefore, d and / have no children together. 

In the end, we observe that the states 1, 9, and 13 have high-arcs to 
states in Uh U Uc, but 9 and 13 are discarded before, so we conclude 
the construction by adding block ?7a = {1} to JF, which corresponds to 
the extant vertex. The resulting pedigree is the one shown on the left 
of Figure [TJ 



References 

[1] R.L. Cann, M. Stoneking, and A.C. Wilson. Mitochondrial DNA and human 
evolution. Nature, 325:32-36, 1987. 

[2] A. Cayley. A theorem on trees. Quarterly Journal of Mathematics, Oxford Se- 
ries, 23:376-378, 1889. Also in "The Collected Mathematical Papers of Arthur 
Cayley," Vol. XIII, pp. 26-28, Cambridge University Press, Cambridge, UK, 
1897. 

[3] International Human Genome Sequencing Consortium. Initial sequencing and 
analysis of the human genome. Nature, 2001. 

[4] J. Felsenstein. Inferring Phytogenies. Sinauer Press, 2004. 

[5] G.R. Grimmett and D.R. Stirzaker. Probability and Random Processes. Oxford 
University Press, New York, third edition, 2001. 

[6] T.A. McKee and F.R. McMorris. Topics in Intersection Graph Theory. SIAM 
Monographs on Discrete Mathematics and Applications. Society for Industrial 
and Applied Mathematics (SIAM), Philadelphia, PA, 1999. 

[7] D.L.T. Rohde, S. Olson, and J.T. Chang. Modelling the recent common an- 
cestry of all living humans. Nature, 431(7008), 2004. 

[8] M. Steel and J. Hein. Reconstructing pedigrees: a combinatorial perspective. 
Journal of Theoretical Biology, 2006. 

[9] B.D. Thatte. Combinatorics of pedigrees. Preprint arXiv:math. CO/0609264\ 
16 pages, 2006. 



20 



BHALCHANDRA D. THATTE AND MIKE STEEL 



[10] E.A. Thompson. Statistical inference from genetic data on pedigrees. NSF- 
CBMS Regional Conference Series in Probability and Statistics, 6. Institute of 
Mathematical Statistics, Bcachwood, OH, 2000. 

[11] J.C. Venter et al. The sequence of the human genome. Science, 291:1304-1353, 
2001. 

BlOMATHEMATICS RESEARCH CENTRE, MATHEMATICS AND COMPUTER SCI- 
ENCE Building, University of Canterbury, Private Bag 4800, Christ- 
church, New Zealand 

E-mail address, Bhalchandra D. Thatte: bdthatteOgmail.com 

E-mail address, Mike Steel: m.steelOmath. canterbury, ac.nz 



